19 research outputs found
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
Large Language Models (LLMs) have demonstrated remarkable capabilities in
various NLP tasks. However, previous works have shown these models are
sensitive towards prompt wording, and few-shot demonstrations and their order,
posing challenges to fair assessment of these models. As these models become
more powerful, it becomes imperative to understand and address these
limitations. In this paper, we focus on LLMs robustness on the task of
multiple-choice questions -- commonly adopted task to study reasoning and
fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs
towards the order of options in multiple-choice questions, we demonstrate a
considerable performance gap of approximately 13% to 75% in LLMs on different
benchmarks, when answer options are reordered, even when using demonstrations
in a few-shot setting. Through a detailed analysis, we conjecture that this
sensitivity arises when LLMs are uncertain about the prediction between the
top-2/3 choices, and specific options placements may favor certain prediction
between those top choices depending on the question caused by positional bias.
We also identify patterns in top-2 choices that amplify or mitigate the model's
bias toward option placement. We found that for amplifying bias, the optimal
strategy involves positioning the top two choices as the first and last
options. Conversely, to mitigate bias, we recommend placing these choices among
the adjacent options. To validate our conjecture, we conduct various
experiments and adopt two approaches to calibrate LLMs' predictions, leading to
up to 8 percentage points improvement across different models and benchmarks
Quantifying Social Biases Using Templates is Unreliable
Recently, there has been an increase in efforts to understand how large
language models (LLMs) propagate and amplify social biases. Several works have
utilized templates for fairness evaluation, which allow researchers to quantify
social biases in the absence of test sets with protected attribute labels.
While template evaluation can be a convenient and helpful diagnostic tool to
understand model deficiencies, it often uses a simplistic and limited set of
templates. In this paper, we study whether bias measurements are sensitive to
the choice of templates used for benchmarking. Specifically, we investigate the
instability of bias measurements by manually modifying templates proposed in
previous works in a semantically-preserving manner and measuring bias across
these modifications. We find that bias values and resulting conclusions vary
considerably across template modifications on four tasks, ranging from an 81%
reduction (NLI) to a 162% increase (MLM) in (task-specific) bias measurements.
Our results indicate that quantifying fairness in LLMs, as done in current
practice, can be brittle and needs to be approached with more care and caution
Distilling Large Language Models using Skill-Occupation Graph Context for HR-Related Tasks
Numerous HR applications are centered around resumes and job descriptions.
While they can benefit from advancements in NLP, particularly large language
models, their real-world adoption faces challenges due to absence of
comprehensive benchmarks for various HR tasks, and lack of smaller models with
competitive capabilities. In this paper, we aim to bridge this gap by
introducing the Resume-Job Description Benchmark (RJDB). We meticulously craft
this benchmark to cater to a wide array of HR tasks, including matching and
explaining resumes to job descriptions, extracting skills and experiences from
resumes, and editing resumes. To create this benchmark, we propose to distill
domain-specific knowledge from a large language model (LLM). We rely on a
curated skill-occupation graph to ensure diversity and provide context for LLMs
generation. Our benchmark includes over 50 thousand triples of job
descriptions, matched resumes and unmatched resumes. Using RJDB, we train
multiple smaller student models. Our experiments reveal that the student models
achieve near/better performance than the teacher model (GPT-4), affirming the
effectiveness of the benchmark. Additionally, we explore the utility of RJDB on
out-of-distribution data for skill extraction and resume-job description
matching, in zero-shot and weak supervision manner. We release our datasets and
code to foster further research and industry applications